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Abstract 

How does the brain recognize three-dimensional objects? An initial step towards the understanding of the 
neural substrate of visual object recognition can be taken by studying first the nature of object representa- 
tion, as manifested in behavioral studies with humans or non-human primates. One fundamental question 
is whether these representations are object or viewer centered. We trained monkeys to recognize computer 
rendered objects presented from an arbitrarily chosen training view, and subsequently tested their ability to 
generalize recognition for views generated by mathematically rotating the objects around any arbitrary axis. 
In agreement with human psychophysical work (Rock and DiVita, 1987, Bulthoff and Edelman, 1992), our 
results show that recognition at the subordinate level becomes increasingly difficult for the monkey as the 
stimulus is rotated away from a familiar attitude, and thus provide additional evidence in favor of memorial 
representations that are viewer-centered. When the animals were trained with as few as three views of the 
object, 120° apart, they could often interpolate recognition for all views resulting from rotations around the 
same axis. The possibility thus exists that even in the case of a viewer-centered recognition system, a small 
number of stored views may suffice to achieve the view-invariant performance that humans and non-human 
primates typically achieve when recognizing familiar objects. These results are also in agreement with a 
recognition model that accomplishes view-invariant performance by storing a limited number of object views 
or templates together with the capacity to interpolate between the templates (Poggio and Edelman, 1990). 
In such a model, the units involved in representing a learned view are expected to exhibit a bellshaped 
tuning curve centered around the learned view, while interpolation is instantiated in the summed activity 
of the units. 
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1 Introduction 

Most theories of object recognition assume that the vi- 
sual system stores a representation of an object and 
that recognition occurs when this stored representation 
is matched to its corresponding sensory representation 
generated from the viewed object [28]. What is, how- 
ever, the nature of these representations, what is stored 
in memory, and how is matching achieved? A space of 
possible representations could be characterized by ad- 
dressing the issues of (1) the recognition task, (2) the 
attributes to be represented, (3) the nature of primitives 
that would describe these attributes, and (4) the spatial 
reference frame in respect to which the object is defined. 

Representations may vary for different recognition 
tasks. A fundamental task for any recognition system 
is to cut up the environment into categories the mem- 
bers of which, although nonidentical, are conceived of 
as equivalent. Such categories often relate to each other 
by means of class inclusion, forming taxonomies. Ob- 
jects are usually recognized first at a particular level of 
abstraction, called the basic level [25]. For example, a 
Gold en- rein ever is more likely to be first perceived as 
a dog, rather than as a retriever or a mammal. Classi- 
fications at the basic level carry the highest amount of 
information about a category and are usually character- 
ized by distinct shapes [25]. Classifications above the 
basic level, superordinate categories, are more general, 
while those below the basic level, subordinate categories, 
are more specific, sharing a great number of attributes 
with other subordinate categories, and having to a large 
extent similar shape (for a thorough discussion of cate- 
gories see [8,24,25]). Representations of objects at differ- 
ent taxonomic levels may differ in their attributes, the 
nature of primitives describing various attributes, and 
the reference frame used for the description of the ob- 
ject. 

In primate vision, shape seems to be the critical at- 
tribute for object recognition. Material properties, such 
as color or texture may be important primarily at the 
most subordinate levels. Recognition of objects is typi- 
cally unaffected in gray-scale photographs, line drawings, 
or in cartoons with wrong color and texture information. 
An elephant, for instance, would be recognized as an ele- 
phant, even if it were painted yellow and textured with 
blue spots. Evidence as to the importance of shape for 
object perception comes also from clinical studies show- 
ing that the breakdown of recognition, resulting from 
circumscribed damage to the human cerebral cortex, is 
most marked at the subordinate level, at which the great- 
est shape similarities occur [5]. 

Models of recognition differ in the spatial frame 
used for shape representation. Current theories using 
object-centered representations assume either a com- 
plete three-dimensional description of an object [28], or 
a structural description of the image specifying the re- 
lationships among viewpoint-invariant volumetric primi- 



tives [1,12]. In contrast, viewer-centered representations 
model three-dimensional objects as a set of 2D views, 
or aspects, and recognition consists of matching image 
features against the views in this set. 

When tested against human behavior, object-centered 
representations predict well the view-independent recog- 
nition of familiar objects [1]. However, psychophys- 
ical studies using familiar objects to investigate the 
processes underlying object constancy, i.e. viewpoint- 
invariant recognition of objects, can be misleading be- 
cause a recognition system based on 3D descriptions can 
not easily be discerned from a viewer centered system 
exposed to a sufficient number of object views. Further- 
more, object-centered representations fail to account for 
performance in recognition tasks with various kinds of 
novel objects at the subordinate level [4,6,18,19,27]. 

Viewer-centered representations, on the other hand, 
can account for recognition performance at any taxo- 
nomic level, but they have been often considered im- 
plausible due to the vast amount of memory required 
to store all discriminable object views needed to achieve 
viewpoint invariance. Yet, recent theoretical work shows 
that a simple network can achieve viewpoint invariance 
by interpolating between a small number of stored views 
[16]. Computationally, this network uses a small set of 
sparse data corresponding to an object's training views 
to synthesize an approximation to a multivariate func- 
tion representing the object. The approximation tech- 
nique is known by the name of Generalized Radial Basis 
Functions (GRBFs), and it has been shown to be math- 
ematically equivalent to a multilayer network [17]. A 
special case of such a network is that of the Radial Basis 
Functions (RBFs) that can be conceived of as "hidden- 
layer" units, the activity of which is a radial function of 
the disparity between a novel view and a template stored 
in the unit's memory. Such an interpolation-based net- 
work makes both psychophysical and physiological pre- 
dictions [15] that can be directly tested against behav- 
ioral performance and single cell activity. 

In the experiments described below, we trained mon- 
keys to recognize novel objects presented from one view, 
and subsequently tested their ability to generalize recog- 
nition for views generated by mathematically rotating 
the objects around arbitrary axes. The stimuli, exam- 
ples of which are shown in Figure 1, were similar to 
those used by Edelman and Bulthoff (1992) [6] in hu- 
man psychophysical experiments. Our aim was to ex- 
amine whether non-human primates show viewpoint in- 
variance at the subordinate level of recognition. Brief 
reports of these experiments have been published previ- 
ously [10,11]. 

2 Materials and Methods 

2.1 Subjects and Surgical Procedures 

Three juvenile rhesus monkeys (Macaca mulatto) weigh- 
ing 7-9 kg were tested. The animals were cared for in 



accordance with the National Institutes of Health Guide, 
and the guidelines of the Animal Protocol Review Com- 
mittee of the Baylor College of Medicine. 

The animals underwent a surgery for the placement 
of a head restraint post, and a scleral-search eye coil 
[9] for measuring eye movements. The monkeys were 
given antibiotics (Tribrissen 30 mg/kg) and analgesics 
(Tylenol 10 mg/kg) orally one day before the operation. 
The surgical procedure was carried out under strictly 
aseptic conditions while the animals were anesthetized 
with isoflurane (induction 3.5% and maintenance 1.2% 
- 1.5%, at 0.8 L/min Oxygen). Throughout the surgi- 
cal procedure the animals received 5% dextrose in lac- 
tated Ringer's solution at a rate of 15 ml/kg/hr. Heart 
rate, blood pressure and respiration were monitored con- 
stantly and recorded every 15 minutes. Body tempera- 
ture was kept at 37.4 degrees Celsius using a heating 
pad. Postoperatively, an opioid anelgesic was admin- 
istered (Buprenorphine hydrochloride 0.02 mg/kg, IM) 
every 6 hours for one day. Tylenol (10 mg/kg) and an- 
tibiotics (Tribrissen 30 mg/kg) were given to the animal 
for 3-5 days after the operation. 

2.2 Animal Training 

Standard operant conditioning techniques with positive 
reinforcement were used to train the monkey to perform 
the task. Initially, the animals were trained to recognize 
the target's zero view among a large set of distractors, 
and subsequently were trained to recognize additional 
target views resulting from progressively larger rotations 
around one axis. After the monkey learned to recog- 
nize a given object from any viewpoint in the range of 
±90°, the procedure was repeated with a new object. In 
the early stages of training several days were required 
to train the animals to perform the same task for a new 
object. Four months of training was required on average 
for the monkey to learn generalizing the task across dif- 
ferent types of objects of one class, and about six months 
were required for the animal to generalize for different 
types of object classes. 

Within an object class the similarity of the targets 
to the distractors was gradually increased, and in the fi- 
nal stage of the experiments distractor wire-objects were 
generated by adding different degrees of positional or ori- 
entation noise to the target objects. A criterion of 95% 
correct for several objects was required to proceed with 
the psychophysical data collection. 

In the early phase of the animal's training a reward 
followed each correct response. In the later stages of the 
training the animals were reinforced on a variable-ratio 
schedule which administered a reward after a specified 
average number of correct responses had been given. Fi- 
nally, in the last stage of the behavioral training the 
monkey was rewarded only after ten consecutive correct 
responses. The end of the observation period was sig- 
nalled with a full-screen, green light and a juice reward 



for the monkey. 

During the behavioral training, independent of the re- 
inforcement schedule, the monkey always received feed- 
back as to the correctness of its response. One incorrect 
report aborted the entire observation period. During the 
psychophysical data collection, on the other hand, the 
monkey was presented with novel objects and no feed- 
back was given during the testing period. The behav- 
ior of the animals was continuously monitored during 
the data collection by computing on-line hit rate and 
false alarms. To discourage arbitrary performance or 
the development of hand-preferences, e.g. giving only 
right hand responses, sessions of data collection were 
randomly interleaved with sessions with novel objects, 
in which incorrect responses aborted the trial. 

2.3 Visual Stimuli 

Wire-like and spheroidal objects were generated mathe- 
matically and presented on a color monitor (Figure 1). 
The selection of the vertices of the wire objects within 
a three-dimensional space was constrained to exclude 
intersection of the wire-segments and extremely sharp 
angles between successive segments, and to ensure that 
the difference in the moment of inertia between different 
wires remained within a limit of 10%. Once the vertices 
were selected the wire objects were generated by deter- 
mining a set of rectangular facets covering a hypothetical 
surface of a tube of a given radius that joined successive 
vertices. 

The spheroidal objects were created through the gen- 
eration of a recursively-subdivided triangle mesh ap- 
proximating a sphere. Protrusions were generated by 
randomly selecting a point on the sphere surface and 
stretching it outward. Smoothness was accomplished by 
increasing the number of triangles forming the polyhe- 
dron that represents one protrusion. Spheroidal stimuli 
were characterized by the number, sign (negative sign 
corresponded to dimples), size, density and sigma of 
the gaussian type protrusions. Similarity was varied by 
changing these parameters as well as the overall size of 
the sphere. 

3 Results 

3.1 Viewpoint-Dependent Recognition 
Performance 

Three monkeys and two human subjects participated in 
this experiment yielding similar results. Only the mon- 
key data are presented in this paper. The animals were 
trained to recognize any given object viewed on one oc- 
casion in one orientation, when presented on a second 
occasion in a different orientation. Technically, this is 
a typical recognition, "old-new" task, whereby the sub- 
ject's ability to retain stimuli to which it has been ex- 
posed is tested by presenting those stimuli intermixed 
with other objects never before encountered. The sub- 
ject is required to state for each stimulus whether it is 



"old", i.e. familiar, or "new", i.e. never seen before. This 
type of task is similar to the yes-no task of detection in 
psychophysics and can be studied under the assumptions 
of the signal detectability theory [7,13]. 

Figure 2a describes the sequence of events in a single 
observation period. Successful fixation of a central light 
spot was followed by the learning phase, during which 
the monkeys were allowed to inspect an object, the tar- 
get, from a given viewpoint, arbitrarily called the zero 
vtew. To provide the subject with 3D structure infor- 
mation, the target was presented as a motion sequence 
of 10 adjacent, Gouraud-shaded views, 2° apart, cen- 
tered around the zero view. The animation was accom- 
plished at a 2 frames-per-view temporal rate, i.e. each 
view lasted 33.3 msec, yielding the impression of an ob- 
ject oscillating slowly ±10° around a fixed axis. 

The learning phase was followed by a short fixation 
period after which the testing phase started. Each test- 
ing phase consisted of up to 10 trials. The beginning 
of a trial was indicated by a low-pitched tone, immedi- 
ately followed by the presentation of the test stimulus, 
a shaded, static view of either the target or a dtstrac- 
tor. Target views were generated by rotating the object 
around one of four axes, the vertical, the horizontal, the 
right oblique, or the left oblique (Fig. 2b). Distractors 
were other objects of the same or different class (Fig. 1). 

Two levers were attached to the front panel of the 
monkey chair, and reinforcement was contingent upon 
pressing the right lever each time the target was pre- 
sented. Pressing the left lever was required upon pre- 
sentation of a distractor. Note (see methods below) that 
no feedback was given to the animals during the psy- 
chophysical data collection. A typical experimental ses- 
sion consisted of a sequence of 60 observation periods, 
each of which lasted about 25 seconds. 

Figure 3a shows the performance of one of the mon- 
keys for rotations around the vertical axis. Thirty target 
views and 60 distractor objects were used in this experi- 
ment. On the abscissa of the graph we plot the rotation 
angle and on the ordinate the experimental hit rate. The 
small squares show performance for each tested view for 
240 presentations. The solid line was obtained by a dis- 
tance weighted least squares smoothing of the data using 
the McLain algorithm [14]. The small insets show ex- 
amples of the tested views. The monkey could identify 
correctly the views of the target around the zero view, 
while its performance dropped below chance levels for 
disparities larger than 30 degrees for leftward rotations, 
and larger than 60 degrees for rightward rotations. Per- 
formance below chance level is probably the result of the 
large number of distractors used within a session, which 
limited learning of the distractors per se. Therefore an 
object that was not perceived as a target view was read- 
ily classified as distractor. 

Figure 3b shows the false alarm rate, that is, the per- 
centage of time that a distractor object was reported as 



a view of the target. The abscissa shows the distractor 
number, and the squares the false alarm rate for 20 pre- 
sentations of each distractor. Recognition performance 
for rotations around the vertical, horizontal, and the two 
oblique axes (±45°) can be seen in Figure 3c. The X and 
Y axis on the bottom face of the plot show the rotations 
in depth, and the Z axis the experimental hit rate. 

To exclude the possibility that the observed view de- 
pendency was specific to non-opaque structures lacking 
extended surface, we have also tested recognition perfor- 
mance using spheroidal, amoeba-like objects with char- 
acteristic protrusions and concavities. Thirty-six views 
of a target amoeba and 120 distractors were used in any 
given session. As illustrated in Figure 4 the monkey 
was able to generalize only for a limited number of novel 
views clustered around the views presented in the train- 
ing phase. In contrast, performance was found to be 
viewpoint-invariant when the animals were tested for ba- 
sic level classifications, or when they were trained with 
multiple views of wire-like or amoeba-like objects. Fig- 
ure 5 shows the mean performance of three monkeys for 
each of the object classes tested. Each curve was gener- 
ated by averaging individual hit rate measurements ob- 
tained from different animals for different objects within 
a class. The data in Figure 5b were collected from three 
monkeys using two shperoidal objects. The asymmetric 
tuning curve denoting better recognition performance for 
rightwards rotations is probably due to asymmetric dis- 
tribution of characteristic protrusions in the two amoe- 
boid objects. Figure 5c shows the ability of monkeys 
to recognize common objects, e.g. a teepot, presented 
from various viewpoints. Distractors were other common 
objects or simple geometrical shapes. Since all animals 
were already trained to perform the task indepent of the 
object type used as a target, no familiarization with the 
object's zero-view preceded the data collection in these 
experiments. Yet, the animals can generalize recognition 
for all tested novel views. 

For some objects the subjects were better in their abil- 
ity to recognize the target from views resulting from 
180 degree rotations. This type of behavior is evident 
in Figure 6a for one of the monkeys. As can be seen 
in the figure, performance drops for views farther than 
30° but it resumes as the unfamiliar views of the tar- 
get approach the 180° view of the target. This behavior 
was specific to those wire-like objects, for which the zero 
and 180° views appeared as mirror-symmetrical images 
of each other, due to accidental minimal self-occlusion. 
In this respect, the improvement in performance paral- 
lels the reflectional invariance observed in human psy- 
chophysical experiments [2]. Such reflectional invariance 
may also partly explain the observation that informa- 
tion about bilateral symmetry simplifies the task of 3D 
recognition by reducing the number of views required to 
achieve object constancy [30]. Not surprisingly, perfor- 
mance around the 180 degree view of an object did not 



improve for any of the opaque, spheroidal objects used 
in these experiments. 

3.2 Generalization Field: Simulations 

Poggio and Edelman (1990) described a regularization 
network capable of performing view-independent recog- 
nition of three-dimensional wire-like objects, after initial 
training with a limited set of views of the objects [16]. 
The set size in their experiments, 80-100 views of an ob- 
ject for the entire viewing sphere, predicts a generaliza- 
tion field of about 30 degrees for any given rotation axis, 
which is in agreement with human psychophysical work 
[4,6,18,19], and with the data presented in this paper. 

Figure 7 illustrates an example of such a network and 
its output activity. A 2D view (Fig. 7a) can be rep- 
resented as a vector of some visible feature points on 
the object. In the case of wire objects, these features 
could be the x,y coordinates of the vertices, the ori- 
entation, corners, size, length, texture and color of the 
segments, or any other characteristic feature. In the ex- 
ample of Figure 7b the input vector consists of seven 
segment orientations. For simplicity we assume as many 
basis functions as the views in the training set. Each 
basis unit, U;, in the "hidden-layer" calculates the dis- 
tance || V— Tj'|| of the input vector V from its center T;, 
i.e. its learned or "preferred" view, and it subsequently 
computes the function exp(— ||V — T;||) of this distance. 
The value of this function is regarded as the activity of 
the unit and it peaks when the input is the trained view 
itself. The activity of the network is conceived of as 
the weighted, linear sum of each unit's output. In the 
present simulations we assume that each unit's output 
is superimposed on Gaussian noise, N(V,<7^), the sigma 
(j\ of which was estimated from single-unit data in the 
inferotemporal cortex of the macaque monkey [11]. 

The four plots in Figure 7c show the output of each 
RBF unit when presented with views generated by ro- 
tations around the vertical axis. Units Ui through U4 
are centered on the 0, 60, 120, and 180 degree views of 
the object respectively. The abscissa of the plots shows 
the rotation angle and the ordinate the unit's output 
normalized at its response to its center. Note the bell- 
shaped response of each unit as the target object is ro- 
tated away from its familiar attitude. The output of each 
unit can be highly asymmetric around the center since 
the independent variable in the plots (rotation angle) is 
different from the argument of the exponential function. 
Figure 7d shows the total activity of the network under 
"zero" noise conditions. The thick, gray line on the left 
plot illustrates the network's output when the input is 
any of the 36 tested target views. The right plot shows 
its mean activity for any of the 36 views of each of the 60 
distractors. The thick, black lines in Figures 7b, c, and d 
show the representation and the activity of the same net- 
work when trained with only the zero view, simulating 
the actual psychophysical experiments described above. 



To directly compare the network performance with the 
psychophysical data described above we used the same 
wire objects used in our first experiment (Generalization 
Fields), and applied a decision theoretic analysis on the 
network's output [7]. In Figure 8a the curve /t(A), to 
the right, represents the distribution of network activ- 
ities that occur on those occasions, in which the input 
is a view of the target. Accordingly, the curve fo(X), 
to the left, represents the distribution of activities when 
the input is a given distractor. The abscissa of the graph 
represents stimulus strength, which increases for increas- 
ing familiarity of the object, that is for views nearer to 
the trained view. Taken as an ideal observer's opera- 
tion, the network's decision to respond "old" (target) or 
"new" (distractor) depends on an adopted decision crite- 
rion Xc- The gray area on the right of Ac represents the 
a posteriori probability of the network correctly identi- 
fying a target, and it is denoted with P(T\T), while the 
dark cross-hatched area on the right of Xc represents 
the probability P(T\D) of a false alarm. On the left 
of Xc, the area marked with horizontal lines gives the 
probability of a correct rejection, and the area with verti- 
cal lines represents the probability of failing to recognize 
the target. As the cutoff point Xc runs through its pos- 
sible values, it generates a curvilinear relation between 
P(T\T) and P(T\D) (Fig. 8b) known as the Receiver 
Operating Characteristic (ROC) curve. The area un- 
derneath this curve has been shown to amount to the 
percentage correct performance of an ideal observer in 
a two-alternative forced-choice (2AFC) task [7] (page 
45-47). In this model, performance depends solely on 
the distance d! between the means of the /t(A) and 
fo(X) distributions, revealing the actual sensitivity of 
the recognition system. The distance d! is determined 
in standard deviation units. A basic assumption in this 
type of analysis is that the events leading to an "old" or 
"new" response are normally distributed. Therefore, the 
selection of the vertices of the wire-like objects was con- 
strained to ensure that the activity of the network across 
the set of different distractors was distributed normally 
(Fig. 8c). 



The white bars in Figure 9a show the distribution of 
the network activity when the input was any of the 60 
distractor wire objects. Black bars represent the activ- 
ity distribution for a given target view (-50, -30, 0, 30, 
and 50 degrees). Complete ROC curves for views gener- 
ated by leftward and rightward rotations are illustrated 
in Figures 9b and c respectively. Figure 9d shows the 
performance of the network as an observer in a 2AFC 
task. Open squares represent the area under the cor- 
responding ROC curve, and the gray, thick line shows 
modeling of the data with a gaussian function computed 
using the Quasi-Newton minimization technique. 



3.3 Generalization Field: Psychophysics 

The purpose of these experiments was to generate psy- 
chometric curves that could be used for comparing the 
psychophysical, physiological, and computational data 
in the context of the above task. One way to generate 
ROC curves in psychophysical experiments is to vary 
the a prion probability of signal occurance, and instruct 
the observer to maximize the percentage of correct re- 
sponses. Since the training of the monkeys was designed 
to maximize the animal's correct responses, changing 
the a prion probability of target occurance did induce 
a change in the animal's decision criterion as is evident 
in the variation of hits and false alarms in each curve of 
the Figures 10a and b. 

The data were obtained by setting the a priori prob- 
ability of target occurance in a block of observation pe- 
riods to 0.2, 0.4, 0.6, or 0.8. Figures 10a and b show 
ROC curves for leftward and rightward rotations respec- 
tively. Each curve is created from the four pairs of hit 
and false alarm rates obtained for one given target view. 
All target views were tested using the same set of distrac- 
tors. The percentage-correct performance of the monkey 
is plotted in Figure 10c. Each filled circle represents the 
area under the corresponding ROC curve in Figures 10a 
and b. The thick, gray line shows modeling of the data 
with a gaussian function. Note the similarity between 
the monkey's performance and the simulated data (thin 
gray line). 

3.4 Interpolation between two trained views 

A network, such as that in Figure 7, represents an object 
by a set of 2D views, the templates, and when the ob- 
ject's attitude changes, the network generalizes through 
nonlinear interpolation. In the simple case, in which 
the number of basis functions is taken to be equal to the 
number of views in the training set, intepolation depends 
on the Ci and a of the basis functions, and on the dis- 
parity between the training views. Furthermore, unlike 
schemes based on linear combination of 2D views [29], 
the non-linear interpolation model predicts recognition 
of novel views beyond the above measured generalization 
field to occur for only those views situated between the 
templates. 

To test this prediction experimentally, the ability of 
the monkeys to generalize recognition to novel views 
was examined after training the animals with two suc- 
cessively presented views of the target 120° and 160° 
apart. 

The results of such an experiment are illustrated in 
Figures 11a and b. The monkey was initially trained to 
identify the 0° and 120° views of a wire-like object among 
120 distractor objects of the same class. During this pe- 
riod the animal was given feedback as to the correctness 
of the response. Training was considered complete when 
the monkey's hit rate was consistently above 95%, false 
alarm rate remained below 10%, and the dispersion co- 



efficient of reaction times was minimized. A total of 600 
presentations were required to achieve the above condi- 
tions, after which testing and data collection began. 

During a single observation period, the monkey was 
first shown the familiar 0° and 120° views of the ob- 
ject, and then presented sequentially with 10 stimuli that 
could be either target or distractor views. Within one 
experimental session each of the 36 tested target views 
was presented 30 times. The spikes on the YZ plane of 
the plot show the hit rate for each view generated by 
rotations around the Y axis. The solid line represents a 
distance-weighted, least-squares smoothing of the data 
using the McLain algorithm [14]. The results show that 
interpolation between familiar views may be the only 
generalization achieved by the monkey's recognition sys- 
tem. No extrapolation is evident with the exception of 
the slightly increased hit rate for views around the —120° 
view of the object, that approximately corresponds to a 
180 degree rotation of some of the interpolated views. 

The contour plot summarizes the performance of the 
monkey for views generated by rotating the object 
around the horizontal, vertical, and the two oblique axes. 
Thirty six views were tested for each axis, each presented 
30 times. The results show that the ability of the monkey 
to recognize novel views is limited to the space spanned 
between the two trained views as predicted by the model 
of nonlinear approximation. 

The experiment was repeated after briefly training the 
monkey to recognize the 60° view of the object. Dur- 
ing the second "training period" the animal was simply 
given feedback as to the correctness of the response for 
the 60° view of the object. The results can be seen in 
Figure 11(b). The animal was able to recognize all views 
between the 0° and 120° views. Moreover, performance 
improved significantly around the —120°. 

4 Discussion 

The main findings of this study are (a) that recogni- 
tion of a novel, three-dimensional object depends on the 
viewpoint from which the object is encountered, and (b) 
that perceptual object-constancy can be achieved by fa- 
miliarization with a limited number of views. 

The first demonstration of strong viewpoint depen- 
dence in the recognition of novel objects was that of Rock 
and his collaborators [18,19]. These investigators exam- 
ined the ability of human subjects to recognize three- 
dimensional, smoothly curved wire-like objects seen from 
one viewpoint, when encountered from a different atti- 
tude and thus having a different 2D projection on the 
retina. Although their stimuli were real objects (made 
from 2.5mm wire), and provided the subject with full 
3D information, there was a sharp drop in recognition 
for view disparities larger that approximately 30 degrees. 
In fact, as subsequent investigations showed, subjects 
could not even imagine how wire objects look when ro- 
tated, despite instructions for visualizing the object from 



another viewpoint [31]. Similar results were obtained in 
later experiments by Edelman and Biilthoff (1992) with 
computer-rendered, wire-like objects presented stereo- 
scopically or as flat images [4,6]. 

In this paper we provide evidence of similar view- 
dependency of recognition for the nonhuman primate. 
Monkeys were indeed unable to recognize objects ro- 
tated more than approximately 40 degrees of visual angle 
from a familiar view. These results are hard to recon- 
cile with theories postulating object-centered representa- 
tions. Such theories predict uniform performance across 
different object views, provided 3D information is avail- 
able to the subject at the time of the first encounter. 
Therefore, one question calling for discussion is whether 
or not information about the object's structure was avail- 
able to the monkeys during the learning phase of these 
experiments. 

First of all, wires are visible in their entirety since, 
unlike most opaque natural objects in the environment, 
regions in front do not substantially occlude regions in 
back. Second, the objects were computer-rendered with 
appropriate shading and were presented in slow oscilla- 
tory motion. The motion parallax effects produced by 
such motion yield vivid and accurate perception of the 
3D structure of an object or surface [3,20]. In fact, psy- 
chometric functions showing depth modulation thresh- 
olds as a function of spatial frequency of 3D corruga- 
tions are very similar for surfaces specified through ei- 
ther disparity or motion parallax cues [21-23]. Further- 
more, experiments on monkeys have shown that nonhu- 
man primates, too, possess the ability to see structure 
from motion [26] in random-dot kinematograms. Thus, 
during the learning phase of each observation period, in- 
formation about the three-dimensional structure of the 
target was available to the monkey by virtue of shading, 
the kinetic depth effect, and minimal self-occlusion. 

Could the view-dependent behavior of the animals be 
a result of the monkeys' failing to understand the task? 
The monkey could indeed recognize a two-dimensional 
pattern as such, without necessarily perceiving it as a 
view of an object. Correct performance around the fa- 
miliar view could then be simply explained as the inabil- 
ity of the animal to discriminate adjacent views. Several 
lines of arguments refute such an interpretation of the 
obtained results. For one, the animals easily generalized 
recognition to all novel views of common objects. More- 
over, when the wire-like objects had prominent charac- 
teristics, such as one or more sharp angles, or a closure, 
the monkeys were able to perform in a view-invariant 
fashion. Second, when two views of the target were pre- 
sented in the training phase the animals interpolated, 
often with 100% performance, for any view between the 
two trained views. 

Third, for many wire-like objects the animal's recogni- 
tion was found to exceed criterion performance for views 
that resembled "mirror-symmetrical" , two-dimensional 



images of each other, due to accidental lack of self- 
occlusion. Invariance for reflections has been reported 
earlier in the literature [2], and it clearly represents a 
form of generalization. Finally, human subjects that 
were tested for comparison using the same apparatus 
exhibited recognition performance very similar to that 
of the tested monkeys. 

Thus, it appears that monkeys, just like human sub- 
jects, show rotational invariance for familiar, basic-level 
objects, but they fail to generalize recognition at the sub- 
ordinate level, when fine, shape-based discriminations 
are required to recognize an object. Interestingly, train- 
ing with a limited number of views (about 10 views for 
the entire viewing sphere) was sufficient for all the mon- 
keys tested to achieve view-independent performance. 

Recognition based entirely on fine, shape discrimina- 
tions is not uncommon in daily life. We are certainly able 
to recognize modern sculptures, mountains or cloud for- 
mations. The largely view independent basic level recog- 
nition exhibited by adults may be the result of learning 
of certain irreducible shapes early in life. Even those the- 
ories suggesting that recognition involves the indexing of 
a limited number of volumetric components [1] and the 
detection of their relationships have to face the problem 
of learning components that cannot be further decom- 
posed. In other words, we still have to achieve represen- 
tations of some elementary object forms that transcend 
the special viewpoint of the observer. Such representa- 
tions usually rely on shape coding that is very similar to 
that required for the subordinate level of recognition. 

5 Conclusions 

Our results provide evidence supporting viewer-centered 
object representation in the primate, at least for sub- 
ordinate level classifications. While monkeys, just like 
human subjects, show rotational-invariance for familiar, 
basic-level objects, they fail to generalize recognition for 
rotations more than 30 to 40 degrees when fine, shaped- 
based discriminations are required to recognize an ob- 
ject. The psychophysical performance of the animals is 
consistent with the idea that view-based approximation 
modules synthesized during training may indeed be one 
of several algorithms the primate visual system uses for 
object recognition. 

The visual stimuli used in these experiments were 
designed to provide accurate descriptions of the three- 
dimensional structure of the objects. Therefore our find- 
ings are unlikely to be the result of insufficent depth 
information in the two-dimensional images for building 
a three-dimensional representation. Rather, it suggests 
that construction of viewpoint-invariant representations 
may not be possible for a novel object. Thus the view- 
point invariant performance typically observed when rec- 
ognizing familiar objects may eventually be the result of 
a sufficient number of two-dimensional representations, 
created for each experienced viewpoint. The number of 



viewpoints is likely to depend on the class of an object 
and may reach a minimum for novel objects that belong 
to a familiar class, thereby sharing sufficiently similar 
transformation properties with the other class members. 
Recognition of an individual new face seen from one sin- 
gle view may be such an example. 
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Figure 1: Example of three stimulus objects used in the experiments on object recognition, (a) Wire-like, (b) 
spheroidal, and (c) common objects were rendered by a computer and displayed on a color monitor. The middle column of 
the 'Targets' shows the view of each object as it appeared in the learning phase of an observation period. This view was 
arbitrarily called the zero view of the object. Columns 1, 2, 4, and 5 show the views of each object when rotated -48, -24, 
24, and 48 degrees about a vertical axis respectively. The rightmost column shows an example of a distractor object for each 
object class. Sixty to 120 distractor objects were used in each experiment. 
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Figure 2: Experimental paradigm (a) Description of the task. An observation period consisted of a learning phase, within 
which the target object was presented oscillating ±10° around a fixed axis, and a testing phase during which the subjects were 
presented with up to 10 single, static views of either the target or the distractors. The small inset in this and the following 
figures show examples of the tested views. The subject had to respond by pressing one of two levers, right for the target, and 
left for the distractors. (b) Description of the stimulus space. The viewpoint coordinates of the observer with respect to the 
object were defined as the longitude and the latitude of the eye on a virtual sphere centered on the object. Viewing the object 
from an attitude a, e.g. —60° with respect to the zero view, corresponded to a 60° rightwards rotation of the object around 
the vertical axis, while viewing from an attitude b amounted to a rightwards rotation around the -45° axis. Recognition was 
tested for views generated by rotations around the vertical (Y), horizontal (X), and the two oblique (±45°) axes lying on the 
XY plane. 
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Figure 3: Recognition performance as a function of rotation in depth for wire-like objects. Data from the monkey 
B63A. (a) The abscissa of the graph shows the rotation angle and the ordinate the hit rate. The small squares show 
performance for each tested view for 240 presentations. The solid lines were obtained by a distance weighted least squares 
smoothing of the data using the McLain algorithm. When the object is rotated more than about 30 to 40 degrees away 
performance falls below 40%. (b) False alarms for the 120 different distractor objects. The abscissa shows the distractor 
number, and the squares false alarm rate for 20 distractor presentations, (c) Recognition performance for rotations around 
the vertical, horizontal, and the two oblique axes (±45°). 
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Figure 4: Recognition performance as a function of rotation in depth for spheroidal objects. Data from the monkey 
B63A. Conventions as in figure 3. 
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Figure 5: Mean recognition performance as a function of rotation in depth for different types of objects, (a) and 
(b) show data averaged from three monkeys for the wire and spheroidal objects. Performance of the monkey S5396 for 
common-type objects. Conventions as in figure 3a. 



13 



-160 





60 



180 






(a) 



100 
80 
60 
40 
20 





-180 -135 -90 -45 45 90 135 180 
Rotation Around Y Axis 



(b) 



100 
80 
60 
40 
20 




<^-l 



4tM ] §M 



Iri] d]d]± rffi fliMi Ji ' a — r^ rrfc-'i ■*■ r" -n-^r^-r^- M 

1 20 40 60 80 100 120 
Distractor ID 



Figure 6: Improvement of recognition performance for views generated by 180° rotations of wire-like objects. Data 
from monkey S5396 Conventions as in figure 3(a). This type of performance was specific to only those wire-like objects, the 
zero and 180° views of which resembled mirror symmetrical two-dimensional images due to accidental lack of self-occlusion. 
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Figure 7: A network for object recognition (a) A view is represented as a vector of some visible feature points on the 
object. On the wire objects these features could be the x, y coordinates of the vertices, the orientation, size, length and color 
of the segments, etc. (b) An example of an RBF network in which the input vector consists of the segment orientations. For 
simplicity we assume as many basis functions as the views in the training set, in this example four views (0, 60, 120, and 180 
degrees). Each basis unit, U;, in the "hidden-layer" calculates the distance || V — Tj || of the input vector V from its center Tj, 
i.e. its learned or "preferred" view, and it subsequently computes the function exp(— ||V — Tj||) of this distance. The value of 
this function is regarded as the activity of the unit, and it peaks when the input is the trained view itself. The activity of the 
network is conceived as the weighted, linear sum of each unit's output superimpose to Gaussian noise (e 6, N(V,<r u )). Thick 
lines show an instance of the network that was trained only with the zero view of the target, (c) Plots 1-4 show the output 
of each RBF unit, under "zero-noise" conditions, when the unit is presented with views generated by rotations around the 
vertical axis, (d) Network output for target and distractor views. The thick gray line on the left plot depicts the activity of 
the network trained with 4 and the black line with one view (the zero view). The right plot shows the the network's output 
for 36 views of 60 distractors. 
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Figure 8: Decision theoretic analysis of the network output, (a) The curve /t(X), to the right, represents the distribution 
of network activities that occur on those occasions when the input is a view of the target. The curve fr>(X), to the left, 
represents the distribution of activities when the input is a given distractor. The network's decision whether an input is a 
target or a distractor depends on the decision criterion Xc- The gray area on the right of Xc represents the probability 
P(T|T) of the network correctly identifying a target and the dark dotted area on the right of Xc represents the probability 
P(T\D) of a false alarm. On the left of Xc, the area marked with horizontal lines gives the probability of correct rejections, 
and the area with vertical lines represents the probability of failing to recognize a target, (b) As Xc runs through its possible 
values it generates a curvilinear relation between P(T|T) and P(T|P/) (thick black line), the area underneath which has been 
shown to amount to the criterion independent percentage-correct responses of an ideal observer in a 2AFC task. The later 
discriminability measure depends only on the distance d' between the distractor and target distributions, (c) Multiple normal 
probability density functions can be approximated by a single gaussian distribution, indicated by the thick gray line, when 
the means of the distributions are separated by a fraction of the standard deviation. 
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Figure 9: Reciever operating characteristic (ROC) curves and performance of the RBF network, (a) White bars show 
the distribution of the network activity when the input was any of the 60 distractor wire objects. Black bars represent the 
actvity distribution for a given target view (-50, -30, 0, 30, and 50 degrees), (b) Reciever operating characteristic curves for 
views generated by leftward rotations, (c) Reciever operating characteristic curves for views generated by rightward rotations, 
(d) Network performance as an observer in a 2AFC task. Filled squares represent the activity of the network. The solid line 
is the distance weighted least squares smoothing of the data for all tested views. The dashed line shows chance performance. 
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Figure 10: ROC curves from one monkey in the old-new task used to study recognition. The data were obtained by 
varying the a priori probability of target occurance in block of observation periods. The values used in this experiment were 
0.2, 0.4, 0.6, and 0.8. (a) Each curve corresponds to a set of hit and false alarm rate values measured for a rightward rotation. 
Rotations were done in 15° steps, (b) Same as in (a), but for leftward rotations, (c) Recognition performance for different 
object views. Each filled circle represents the area under the corresponding ROC curve. The solid line models the data with 
a single gaussian function. 
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Figure 11: Interpolation between two trained views, (a) In the learning phase the monkey was presented sequentially with 
the 0° and 120° views ol a wire-like object, and subsequently tested with 36 views around any ol the four axes (horizontal, 
vertical and the two obliques). The spikes normal to the contour- plot show the hit rate for rotations around the Y axis. Note 
the somewhat increased hit rate for views around the —120° view. The contour plot shows the performance ol the for views 
generated by rotating the object around either ol the horizontal, vertical, and the two oblique axes, (b) Repetition ol the 
same experiment alter briefly training the monkey with the 60° view ol the wire object. The animal can now recognize any 
view in the range ol —30° to 140° as well as around the —120° view. As predicted by the RBF model, generalization is limited 
to views between the two trained views. 
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